
World war II has been sealed in the history books as one of the most catastrophic, devastation causing events in all of history. Resulting in billions of dollars worth of damages and casualties in the millions, there was no shortage of pure carnage to be seen. Though these numbers are clear and present, it is sometimes important to ponder what it takes for sheer destruction of this magnitude. The data set we analyze here houses a plethora of knowledge and data regarding the destructive capabilities of aircraft during the course of WWII. It houses many important data points such as Aircraft type, Bomb type, target, bomb size, and so much more. It is important to note however, that this data only represents targets and missions conducted by the allied forces, without the inclusion of Russia. This data is not representative of the axis powers contribution to the world's destruction.
In this tutorial, our goal is to provide you with a more colloquial and readable understanding of the historical data collected during the war. The data provided in its raw form is incredibly clunky, laden with missing values and column after column of confusing information. Our tutorial works to tidy this data, as well as to manipulate it and frame it for better understanding and investigation. We hope that after reading our tutorial you walk away with a better understanding of the destruction of WWII, and the sheer amount of bombs that were dropped as a result of the multiple theatres of war. Perhaps with this knowledge you can hold an engaging conversation with your friends, state a fun fact or two, or even write a paper about the tragedy and horror of human ingenuity. If you already knew all of this information, we hope you enjoyed our visual representations, and a short refresher course in destructive history. If this is new to you, rejoice in the fact that we live in a world where this war has ended, and that you need not fear the looming threat of a B17 bomber over head.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
from sklearn import tree
from sklearn import metrics
from sklearn.model_selection import *
data = pd.read_csv("data/THOR_WWII_DATA_CLEAN.csv", sep=",", encoding="latin-1") # For some reason the CSV did not like to be loaded without specifiying this encoding
data
/home/froth/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (14,48,49,51,52,59,60) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| WWII_ID | MASTER_INDEX_NUMBER | MSNDATE | THEATER | NAF | COUNTRY_FLYING_MISSION | TGT_COUNTRY_CODE | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | ... | CALLSIGN | ROUNDS_AMMO | SPARES_RETURN_AC | WX_FAIL_AC | MECH_FAIL_AC | MISC_FAIL_AC | TARGET_COMMENT | MISSION_COMMENTS | SOURCE | DATABASE_EDIT_COMMENTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | SPADAFORA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 4285 | 20028.0 | 2/20/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | PUERTA PRINCESA | UNIDENTIFIED TARGET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | COSENZA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | GIOJA TAURO | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
| 4 | 8167 | 14639.0 | 2/23/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | BALETE PASS | WOODED AREA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178276 | 135311 | NaN | 12/14/1942 | PTO | RAAF | AUSTRALIA | NaN | NEW GUINEA | BUNA AREA - 5 DESTROYERS / MOUTH OF KUMUSI RIVER | ENEMY WARSHIPS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178277 | 55692 | NaN | 9/18/1940 | MTO | SAAF | SOUTH AFRICA | 24.0 | ETHIOPIA | YAVELLO | AERODROME | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | DUST CLOUDS IN THE MIDDLE EAST | AUSTEN |
| 178278 | 133019 | NaN | 11/17/1942 | PTO | RAAF | AUSTRALIA | NaN | TIMOR | BAUCAU | BUILDINGS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178279 | 178741 | NaN | 2/16/1945 | PTO | RNZAF | NEW ZEALAND | NaN | NEW IRELAND | KAVIENG SWEEP | VARIOUS TARGETS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178280 | 122546 | NaN | 6/11/1942 | PTO | RAAF | AUSTRALIA | NaN | NaN | NaN | ENEMY MERCHANT SHIP | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
178281 rows × 62 columns
Lets drop some of those unecessary columns that we won't be using in order to make the data a bit more readable.
#
data.drop(['CALLSIGN','ROUNDS_AMMO','SPARES_RETURN_AC','WX_FAIL_AC','MECH_FAIL_AC','MISC_FAIL_AC','TARGET_COMMENT','MISSION_COMMENTS','SOURCE','DATABASE_EDIT_COMMENTS'], axis=1, inplace=True)
data.drop(['AC_LOST','AC_DAMAGED','AC_AIRBORNE','AC_DROPPING','TIME_OVER_TARGET','SIGHTING_METHOD_CODE','SIGHTING_EXPLANATION','BDA'], axis=1, inplace=True)
data.drop(['THEATER','NAF','TGT_COUNTRY_CODE','NUMBER_OF_FRAG','TYPE_OF_FRAG','LBS_FRAG','TONS_OF_FRAG','TOTAL_LBS','TAKEOFF_BASE'], axis=1, inplace=True)
data.drop(['MASTER_INDEX_NUMBER','WWII_ID','TGT_ID','TGT_INDUSTRY_CODE','TGT_INDUSTRY','LBS_HE','TONS_OF_HE','NUMBER_OF_IC','TYPE_OF_IC','LBS_IC','TONS_OF_IC'], axis=1, inplace=True)
data.drop(['UNIT_ID','TGT_PRIORITY_EXPLANATION','ALTITUDE','ALTITUDE_FEET','NUMBER_OF_HE','TYPE_OF_HE'], axis=1, inplace=True)
data.drop(['MDS','MSN_TYPE','TGT_PRIORITY',], axis=1, inplace=True)
data
| MSNDATE | COUNTRY_FLYING_MISSION | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | SOURCE_LATITUDE | SOURCE_LONGITUDE | LATITUDE | LONGITUDE | AIRCRAFT_NAME | AC_ATTACKING | TOTAL_TONS | TAKEOFF_COUNTRY | TAKEOFF_LATITUDE | TAKEOFF_LONGITUDE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8/15/1943 | USA | ITALY | SPADAFORA | NaN | 38.2166667 | 15.366667 | 38.220000 | 15.370000 | A36 | NaN | 10.00 | SICILY | 37.131022 | 14.321464 |
| 1 | 2/20/1945 | USA | PHILIPPINE ISLANDS | PUERTA PRINCESA | UNIDENTIFIED TARGET | 945 | 11845.0 | 9.750000 | 118.750000 | A20 | 1.0 | NaN | NaN | NaN | NaN |
| 2 | 8/15/1943 | USA | ITALY | COSENZA | NaN | 3916N | 1615.0 | 39.270000 | 16.250000 | A36 | NaN | 9.00 | SICILY | 37.131022 | 14.321464 |
| 3 | 8/15/1943 | USA | ITALY | GIOJA TAURO | NaN | 3826N | 1554.0 | 38.430000 | 15.900000 | A36 | NaN | 7.50 | SICILY | 37.131022 | 14.321464 |
| 4 | 2/23/1945 | USA | PHILIPPINE ISLANDS | BALETE PASS | WOODED AREA | 1605 | 12055.0 | 16.083333 | 120.916667 | A20 | 1.0 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178276 | 12/14/1942 | AUSTRALIA | NEW GUINEA | BUNA AREA - 5 DESTROYERS / MOUTH OF KUMUSI RIVER | ENEMY WARSHIPS | -8.472085 | 148.234293 | -8.472085 | 148.234293 | A20 | 10.0 | 2.25 | NaN | NaN | NaN |
| 178277 | 9/18/1940 | SOUTH AFRICA | ETHIOPIA | YAVELLO | AERODROME | NaN | NaN | 4.883333 | 38.083333 | JU.86 | 2.0 | 0.00 | KENYA | -1.283300 | 36.816700 |
| 178278 | 11/17/1942 | AUSTRALIA | TIMOR | BAUCAU | BUILDINGS | -8.471111 | 126.458333 | -8.471111 | 126.458333 | HUDSON | 8.0 | 2.40 | AUSTRALIA | NaN | NaN |
| 178279 | 2/16/1945 | NEW ZEALAND | NEW IRELAND | KAVIENG SWEEP | VARIOUS TARGETS | NaN | NaN | -2.583333 | 150.833333 | F4U | 8.0 | 4.00 | NaN | NaN | NaN |
| 178280 | 6/11/1942 | AUSTRALIA | NaN | NaN | ENEMY MERCHANT SHIP | -903 | 12432 | -9.050000 | 124.533333 | HUDSON | 1.0 | 0.50 | AUSTRALIA | NaN | NaN |
178281 rows × 15 columns
Now that we have loaded the data lets do some basic analysis! Lets see how many tons of TNT worth of bombs the allies dropped over the corse of the war. We can do this easily by specifying the column containing the total tons of weaponry dropped and calling sum on that column.
total_tons = data["TOTAL_TONS"].sum()
total_tons
4268503.783999999
That is a pretty isane amount of weaponry dropped over the corse of WW2. Now lets try to recreate that graph on the website we got the data from that showed which countries had the most tons of weaponry dropped on them.
# We want to group the data by country being bombed, sum the total tons dropped, and sort by decreasing order. Then we to reset
# the index so we can query by TGT_COUNTRY and then grab the first 5 elements to graph
per_country = (data.groupby(["TGT_COUNTRY"]).sum()).sort_values(by="TOTAL_TONS", ascending=False).reset_index().head(5)
plt.figure(figsize=(20,10)) # Set image size
bar_graph = sns.barplot(x="TGT_COUNTRY",y="TOTAL_TONS", data=per_country)
# This line makes it so we don't use scientific notation for the y axis
bar_graph.ticklabel_format(style="plain", axis="y")
# Set labels
bar_graph.set_xlabel("Target Country")
bar_graph.set_ylabel("Tons of TNT")
bar_graph.set_title("Tons of TNT dropped on each country")
Text(0.5, 1.0, 'Tons of TNT dropped on each country')
Wow! That is a lot of bombs dropped on Germany. Now lets now see who was dropping those bombs. We can do a similar strategey as the last graph except this time we will group by the country flying the mission instead of the target country.
most_dropped = data.groupby(["COUNTRY_FLYING_MISSION"]).sum().sort_values(by=["TOTAL_TONS"],ascending=False).reset_index()
plt.figure(figsize=(20,10))
bar_graph = sns.barplot(x="COUNTRY_FLYING_MISSION",y="TOTAL_TONS", data=most_dropped)
bar_graph.ticklabel_format(style="plain", axis="y")
bar_graph.set_xlabel("Country dropping")
bar_graph.set_ylabel("Tons of TNT")
bar_graph.set_title("Tons of TNT dropped by each country")
Text(0.5, 1.0, 'Tons of TNT dropped by each country')
The USA and Great Britan unsuprisingly did the most dropping of bombs. Now lets find which planes did most of this bomb dropping.
# Lets group the data by aircraft name and see which aircraft has dropped the most tons of bombs
aircrafts = data.groupby(["AIRCRAFT_NAME"]).sum().sort_values(by=["TOTAL_TONS"],ascending=False).reset_index().head(9) # Lets get the top 9 and leave room for an other category
# Create percentage col to use for piechart
aircrafts["PERCENTAGE_DROPPED"] = (aircrafts["TOTAL_TONS"]/total_tons)*100
# Calculate remaining
other_percentage = 100 - aircrafts["PERCENTAGE_DROPPED"].sum()
# Append other to the dataframe
aircrafts = aircrafts.append({"PERCENTAGE_DROPPED" : other_percentage,"AIRCRAFT_NAME" : "Other" }, ignore_index=True)
# plot
plt.figure(figsize=(20,10))
colors = sns.color_palette()[0:10]
plt.pie(aircrafts["PERCENTAGE_DROPPED"], colors=colors, labels=aircrafts["AIRCRAFT_NAME"], autopct="%.1f%%")
plt.title("Percentage of all bombs dropped by aircraft")
plt.show()
The B17 leads this category having dropped almost 30% of all bombs dropped by the US and Great Britan over the war.

This is impressive but not unexpected as the B17 was one of the most mass produced and effictive bombers of the war. Britanica states that the B17 was "was the mainstay of the strategic bombing campaign" for the US.
Now lets try to get a nice overview of the amount of bombing that occured over time.
data["DATE_TIME"] = pd.to_datetime(data["MSNDATE"], format="%m/%d/%Y") # Convert to datetimes for ease of use
by_time = data.groupby(["DATE_TIME"]).sum().reset_index()
plt.figure(figsize=(20,10))
line_graph = sns.lineplot(x="DATE_TIME",y="TOTAL_TONS", data=by_time)
line_graph.set_xlabel("Year")
line_graph.set_ylabel("Tons of TNT")
line_graph.set_title("Tons of TNT over time")
Text(0.5, 1.0, 'Tons of TNT over time')
Hmm what is that small blip we see right before 1941? That seems to be a lot of bombing for so early in the war.
before_1941 = data[data["DATE_TIME"].dt.year < 1941]
outlier = before_1941[before_1941["TOTAL_TONS"] > 4000]
outlier[["TGT_COUNTRY","AIRCRAFT_NAME", "TOTAL_TONS", "TGT_TYPE", "DATE_TIME","AC_ATTACKING"]]
| TGT_COUNTRY | AIRCRAFT_NAME | TOTAL_TONS | TGT_TYPE | DATE_TIME | AC_ATTACKING | |
|---|---|---|---|---|---|---|
| 61118 | SUDAN | WELLESLEY | 4750.0 | FORT | 1940-08-17 | 6.0 |
It turns out this occured in Africa and was related to Northern front, East Africa, 1940. Our data seems to correlate with the wikipedia article under the section about the british attack on fort Gallabat. Our data shows 6 WELLESLEY bombers attacking a fort with around 5000 tons of TNT worth of bombs. Wikipedia states "An RAF contingent of six Wellesley bombers and nine Gladiator fighters were thought sufficient to overcome the 17 Italian fighters and 32 bombers believed to be in range. The infantry assembled 1–2 mi (2–3 km) from Gallabat, whose garrison was unaware that an attack was coming, until the RAF bombed the fort and put the wireless out of action.". An intresting little discovery.
This next section of code works with a slightly more advanced understanding of plotting and dataframe manipulation in pandas. The goal of this cell is to identify a list of all aircraft used during WWII, creating and labeling a scatter plot which will be used to display the amount of TNT (in tons) dropped by each individual aircraft type, over the course of the war. We utilize a clever subplot stacking trick to put multiple plots on the same single plane(pun intended). We then loop through our list of planes, for each plane isolating our dataframe to only contain those rows that relate to the current plane, and for these rows plotting their data points on our graph.
#Create a list of plane names
planes = aircrafts.loc[:,'AIRCRAFT_NAME']
#create a plot object
fig = plt.figure(figsize = (20,10))
#label the plot
plt.title("Tons of Bombs dropped by Aircraft With Respect to time")
plt.xlabel("Year")
plt.ylabel("Tons of TNT")
#utilize subplot stacking to create multiple planes for multiple plots
stack = fig.add_subplot(111)
#iterate over list of planes, plotting respective year and total tons of TNT per plane
for i in planes:
plot_table = data[data["AIRCRAFT_NAME"]==i]
plot_table = plot_table.sort_values(by=['DATE_TIME'])
stack.scatter(plot_table['DATE_TIME'],plot_table['TOTAL_TONS'])
#create legend for list of planes
stack.legend(planes)
<matplotlib.legend.Legend at 0x7f7481bb5790>
Upon first view of the above graph it might seem like it was done incorrectly, and the code does not work the way it is supposed to. This was our initial thought when seeing the clustered results followed by two major outliers far above any other data points. We decided to pull up those particular data points from our dataframe to see what was going on, as these data points made no sense. The following code is what we used to learn more about our very large outliers and determine what our issue could be.
#Isolate the data which appears to be above 14000 total tons in our graph
big = data[data['TOTAL_TONS'] > 14000]
#clean up our dataframe to data that is only relevent to our interpretation
big = big[['COUNTRY_FLYING_MISSION','TGT_COUNTRY','TGT_LOCATION','TGT_TYPE','TOTAL_TONS']]
#print dataframe for visualization
big
| COUNTRY_FLYING_MISSION | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | TOTAL_TONS | |
|---|---|---|---|---|---|
| 126 | USA | JAPAN | HIROSHIMA | ATOMIC BOMB | 15000.0 |
| 127 | USA | JAPAN | NAGASAKI | ATOMIC BOMB | 20000.0 |
After successfully isolating the two outliers we were able to determine that there was in fact no bug in our code, but in fact an anomoly regarding the two and only atomic bombs ever used on a civilization in all of history. We were surpsied to have forgotten these two events, though they imediately made the data clear and understandable. The events, as listed in the dataframe correlate to the bombings of Hiroshima) and Nagasaki by the united states, in August of 1945, using the catastrophic destruction of the atomic bomb. Bombs Fat man, and Little Boy were dropped, with Fat man being the larger of the two. The Bombs can be seen Below. (Little Boy seen first, Fat Man seen second). Next to them is the b29, the bomber that carried them.

Now lets bring in folium to do some visualizing of what areas were bombed the most. Folium will let us create an interactive heatmap in order to see what areas were most bombed by the allies throughout the war. We are going to do a frequency heatmap overlayed with circle markers that denote the most instensive explosions (top 10,000). These circles will indicat, not to scale, the amount of damage caused in relation to other strikes. We also will label a few key cities to guide your interpretation.
# Important cities LATITUDE, LONGITUDE
berlin = 52.520008,13.404954
tokyo = 35.652832, 139.839478
rome = 41.902782, 12.496366
# Atomic Bombs
Hiroshima = 34.3853, 132.4553
Nagasaki = 32.7503, 129.8779
map_osm = folium.Map(location=berlin, zoom_start=5) #Create a map centered on Berlin
valid_lat = data.groupby(data["LATITUDE"].isnull()).get_group(False) # Make sure latitude is not NaN
valid_lat_long = valid_lat.groupby(data["LONGITUDE"].isnull()).get_group(False) # Make sure longitutde is not NaN
valid = valid_lat_long.groupby(data["TOTAL_TONS"].isnull()).get_group(False)
heat_map_data = []
for index, row in valid.sort_values(by="TOTAL_TONS", ascending=False).reset_index().iterrows():
if index < 10000:
folium.Circle(
location=[row["LATITUDE"],row["LONGITUDE"]],
radius=row["TOTAL_TONS"]*2,
color="crimson",
fill=True
).add_to(map_osm)
heat_map_data.append([row["LATITUDE"],row["LONGITUDE"]])
colormap = {0.3: 'white', 0.6: 'grey', 1: 'black'}
HeatMap(heat_map_data,gradient=colormap).add_to(map_osm) # Add the list of points to the map as a heatmap
# Label the capital Axis cities
folium.Marker(location=berlin,popup="<b>Berlin</b>").add_to(map_osm)
folium.Marker(location=tokyo,popup="<b>Tokyo</b>").add_to(map_osm)
folium.Marker(location=rome,popup="<b>Rome</b>").add_to(map_osm)
folium.Marker(location=Hiroshima,popup="<b>Hiroshima</b>").add_to(map_osm)
folium.Marker(location=Nagasaki,popup="<b>Nagasaki</b>").add_to(map_osm)
# Show map
map_osm